TM-LDA: Efficient Online Modeling of the Latent Topic Transitions in Social Media
نویسندگان
چکیده
Latent topic analysis has emerged as one of the most effective methods for classifying, clustering and retrieving textual data. However, existing models such as Latent Dirichlet Allocation (LDA) were developed for static corpora of relatively large documents. In contrast, much of the textual content on the web, and especially social media, is temporally sequenced, and comes in short fragments such as Tweets, Facebook status updates, or comments on YouTube. In this paper we propose a novel topic model, Temporal-LDA or TM-LDA, for efficiently mining streams of social text such as a Twitter stream for an author, by modeling the topics and topic transitions that naturally arise in such data. TM-LDA learns the transition parameters among topics by minimizing the prediction error on topic distribution in subsequent postings. After training, TM-LDA is thus able to accurately predict the expected topic distribution in future posts. To make these predictions more efficient for a realistic online prediction setting, we develop an efficient updating algorithm to adjust transition parameters, as new documents stream in. Our empirical results, over a corpus of over 30 million Twitter posts show that TM-LDA significantly outperforms state-of-the-art static LDA models for estimating the topic distribution of new documents over time. We also demonstrate how TM-LDA is able to highlight interesting variations of common patterns of behavior across different cities, such as differences in the work-life rhythm of cities, and factors responsible for area-specific problems and complaints.
منابع مشابه
Health Monitoring on Social Media over Time
Social media has become a major source for analyzing all aspects of daily life. Thanks to dedicated latent topic analysis methods such as the Ailment Topic Aspect Model (ATAM), public health can now be observed on Twitter. In this work, we are interested in monitoring people’s health over time. Recently, Temporal-LDA (TM–LDA) was proposed for efficiently modeling general-purpose topic transitio...
متن کاملAutomatic keyword extraction using Latent Dirichlet Allocation topic modeling: Similarity with golden standard and users' evaluation
Purpose: This study investigates the automatic keyword extraction from the table of contents of Persian e-books in the field of science using LDA topic modeling, evaluating their similarity with golden standard, and users' viewpoints of the model keywords. Methodology: This is a mixed text-mining research in which LDA topic modeling is used to extract keywords from the table of contents of sci...
متن کاملMining Information from Heterogeneous Sources: A Topic Modeling Approach
In recent years, the phenomenal growth and popularity of social media, news and discussion websites has led to a vast number of information sources available online. These sources generate massive amounts of real-time content on a daily basis making it increasingly difficult to glean true and useful information from them. Automatically categorizing and compressing important contextual informati...
متن کاملModeling virtual organizations with Latent Dirichlet Allocation: A case for natural language processing
This paper explores a variety of methods for applying the Latent Dirichlet Allocation (LDA) automated topic modeling algorithm to the modeling of the structure and behavior of virtual organizations found within modern social media and social networking environments. As the field of Big Data reveals, an increase in the scale of social data available presents new challenges which are not tackled ...
متن کاملDeep Belief Nets for Topic Modeling Workshop on Knowledge-Powered Deep Learning for Text Mining (KPDLTM-2014)
Applying traditional collaborative filtering to digital publishing is challenging because user data is very sparse due to the high volume of documents relative to the number of users. Content based approaches, on the other hand, is attractive because textual content is often very informative. In this paper we describe large-scale content based collaborative filtering for digital publishing. To ...
متن کامل